Random-Forest-Based Analysis of URL Paths
نویسندگان
چکیده
One of the key sources of spreading malware are malicious web sites – either tricking user to install malware imitating legitimate software or, in the case of various exploit kits, initiating malware installation even without any user action. The most common technique against such web sites is blacklisting. However, it provides little to no information about new sites never seen before. Therefore, there has been important research into predicting malicious web sites based on their features. This work-inprogress paper presents a light-weight prediction method using solely lexical features of the site URL and classification by random forests. To this end, three possibilities of feature extraction have been elaborated and investigated on real-world data sets with respect to precision and recall. The obtained results indicate that there is nearly never a significant difference betrweeen the considered methods, and that in spite of the limitation to the lexical features of the site URL, they have an impressive performance in terms of area under the precision-recall curve for the path parts of URLs.
منابع مشابه
Numerical simulation of random paths with a curvature dependent action
We have performed a numerical simulation of an ensemble of fixed length closed random paths, embedded in R, weighted with an action proportional to both, the length of the path and the curvature. This model can be considered as an one dimensional analog of the theory of crystaline random surfaces. Two different regimes have been identified, namely Brownian paths (small curvature coupling) and r...
متن کاملAnalysis of Critical Paths in a Project Network with Random Fuzzy Activity Times
93 AIJ Modeling, Identification, Simulation and Control, Vol 48, No. 2, Fall 2016 Please cite this article using: Kazemi, A., Talebi, A., and Oroojeni Mohammad-Javad, M., 2016. “Analysis of Critical Paths in a Project Network with Random Fuzzy Activity Times”. Amirkabir International Journal of Modeling, Identification, Simulation and Control, 48(2), pp. 93–101. DOI: 10.22060/miscj.2016.831 URL...
متن کاملA Random Forest Classifier based on Genetic Algorithm for Cardiovascular Diseases Diagnosis (RESEARCH NOTE)
Machine learning-based classification techniques provide support for the decision making process in the field of healthcare, especially in disease diagnosis, prognosis and screening. Healthcare datasets are voluminous in nature and their high dimensionality problem comprises in terms of slower learning rate and higher computational cost. Feature selection is expected to deal with the high dimen...
متن کاملPathway analysis using random forests with bivariate node-split for survival outcomes
MOTIVATION There is great interest in pathway-based methods for genomics data analysis in the research community. Although machine learning methods, such as random forests, have been developed to correlate survival outcomes with a set of genes, no study has assessed the abilities of these methods in incorporating pathway information for analyzing microarray data. In general, genes that are iden...
متن کاملApplication of ensemble learning techniques to model the atmospheric concentration of SO2
In view of pollution prediction modeling, the study adopts homogenous (random forest, bagging, and additive regression) and heterogeneous (voting) ensemble classifiers to predict the atmospheric concentration of Sulphur dioxide. For model validation, results were compared against widely known single base classifiers such as support vector machine, multilayer perceptron, linear regression and re...
متن کامل